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Network analysis is currently used In a myriad of contexts: from 
identifying potential drug targets to predicting thie spread of epi- 
demics and designing vaccination strategies, and from finding 
friends to uncovering criminal activity. Despite the promise of 
thie network approacli, thie reliability of network data is a source 
of great concern in all fields where complex networks are stud- 
ied. IHere, we present a general mathematical and computational 
framework to deal with the problem of data reliability in complex 
networks. In particular, we are able to reliably identify both miss- 
ing and spurious interactions In noisy network observations. Re- 
markably, our approach also enables us to obtain, from those 
noisy observations, network reconstructions that yield estimates 
of the true network properties that are more accurate than those 
provided by the observations themselves. Our approach has the 
potential to guide experiments, to better characterize network data 
sets, and to drive new discoveries. 
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Abbreviations: BM, stochastic block model; HRG, hierarchical random graph 

The structure of the network of interactions between the units of 
a system affects the system's dynamics, and conveys informa- 
tion about the functional needs of the system, its evolution, and the 
role of individual units. For these reasons, network analysis has be- 
come a cornerstone of fields as diverse as systems biology and so- 
ciology 11]. Unfortunately, the reliability of network data is often a 
source of concern. In systems biology, high-throughput technologies 
hold the promise to uncover the intricate processes within the cell, 
but are also reportedly inaccurate. Protein interaction data provide, 
arguably, the most blatant example of data inaccuracy: in 2002, a 
systematic comparison of several high-throughput methods to a refer- 
ence high-quality data set showed that these methods have accuracies 
below 20% |2|. Additionally, different methods result in networks 
that have different topological properties | 3 1, and the coverage of real 
interactomes is very limited: 80% of the interactome of yeast [3 | and 
99.7% of the human interactome |4 , 5 1 are still unknown. 

In the social sciences, missing data due to individual non- 
response and dropout |6|, informant inaccuracy |7|, and sampling 
biases |8| are also pervasive. Simulation studies have established 
that these inaccuracies can lead to fundamentally wrong estimates of 
network properties and to misleading conclusions |8|, which is par- 
ticularly worrisome at a time when social network analysis is being 
used for finding new friends and partners, singling out key individu- 
als in organizations, and identifying criminal activity. 

Despite these concerns, the issue of network reliability has only 
been addressed in a field-by-field basis (for example, to deal with 
protein-protein interactions (9l [Tol or to take into account infor- 
mant inaccuracy in social networks |7|), and in studies that only ad- 
dress parts of the problem (for example, to detect missing interac- 
tions 1 1 1]). Therefore, a general framework to deal with the problem 
of data reliability in complex networks is lacking. Here, we develop 
such a framework. Specifically, we show that within our framework 
we can reliably: (i) identify false negatives (missing interactions) and 
false positives (spurious interactions), and (ii) generate, from a single 
observed network, a reconstructed network whose properties (cluster- 



ing coefficient, modularity, assortativity, epidemic spreading thresh- 
old, and synchronizability, among others) are closer to the "true" un- 
derlying network than those of the observed network itself. We show 
that our approach outperforms previous attempts to predict missing 
and spurious interactions, and illustrate the potential of our method 
by applying it to a protein interaction network of yeast 1 12 |. We end 
by discussing how our approach will help to guide experiments and 
new discoveries, and to better characterize important data sets. 



General reliability formalism 

Consider an observed network with adjacency matrix A'-'; Afj = 1 if 
nodes i and j are connected and otherwise. We assume that this ob- 
served network is a realization of an underlying probabilistic model, 
either because the network itself is the result of a stochastic process, 
because the measurement has uncertainty, or both 1 7 1 Q Let us call 
M the set of generative models that could conceivably give rise to 
the observed network, and p{M\A'^) the probability that M G M 
is the model that gave rise to the observation A'''' . If we could get 
a new observation of the network, the outcome would in general be 
different from A'~'\ our best estimate for the probability p{X — x) 
for an arbitrary network property X is 



p{X = x\A'^ 



dMp{X = x\M)p{M\A'^ 



[1] 



where p{X = x\Al) is the probability that X — x ina network gen- 
erated with model M. Using Bayes theorem, we can rewrite Eq. (T} 



p{X = xlA"") = 



dMp{X = x\M)piA°\M)p{M) 



[2] 



Xv, dM'p{AO\M')p{M') 

where p(A'^|Af) is the probability that model M gives rise to A'^ 
among all possible adjacency matrices, and p{M) is the a priori prob- 
ability that model M is the correct one. We call p{X = xlyl*^) the 
reliability of the X = x measurement. 

Stochastic block models 

Given the generality of these arguments, the key to good estimates of 
reliability is to identify sets of models that are general, empirically 
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For simplicity, in this manuscript we use language that is consistent with a situation in which 
a true network exists but is obscured by the inaccuracies of the observation process. Thus, 
we talk about the "true" network, which has no "errors," and about "observed" networks, which 
have "errors." However, the formalism is valid even if the network is itself the outcome of a 
stochastic process. 
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grounded, and analytically or computationally tractable. Here, we 
focus on the family A^bm of stochastic block models f 131 1141 . In 
a stochastic block model, nodes are partitioned into groups and the 
probability that two nodes are connected depends only on the groups 
to which they belong (Fig.[Tll. 

Stochastic block models are empirically grounded in that they 
capture two ubiquitous and fundamental properties of real complex 
networks. First, nodes in real networks are often organized into mod- 
ules or communities I15l ll6i , which may overlap 1 17| or be hierar- 
chically nested |18 191 1111 , so that connections are relatively more 
abundant within modules than between modules. In most real- world 
networks, this modularity is significantly larger than expected from 
chance 1201 1 161 . Second, nodes in real networks fulfill distinct roles 
and connect to each other depending on these roles 1 13 16 1. Role-to- 
role connections are not necessarily assortative 1 21 , 16 1, that is, nodes 
with a certain role may or may not tend to connect with other nodes 
with the same role. In general, stochastic block models are particu- 
larly appropriate when nodes belong to groups and interact with each 
other depending on their group membership (regardless of whether 
interactions occur mostly within groups or between groups). 

Stochastic block models are also appropriate in that they can cap- 
ture other more general connectivity correlations in the network. For 
example, if people establish social connections with others according 
to age, then a block model that partitions individuals into age groups 
will capture some of the correlations in the network. 

In general, complex networks result from a combination of mech- 
anisms, including modularity, role structure, and maybe other factors. 
Although partitions into modules, roles, and age groups, for example, 
can be very different from each other, some block model in the A^bm 
family is likely to capture each of them separately; by sampling over 
all models M £ A4bm we capture a variety of correlations, ideally 
to the exact degree that they are relevant. 

Additionally, stochastic block models are analytically tractable 
because in a stochastic block model the probability that nodes i and j 
are connected depends only on the groups to which they belong [ l4l . 
Therefore, we can calculate the reliability of individual links and the 
reliability of entire networks. 



Link reliability: missing and spurious interactions 

The reliability of an individual link is R^tj = pBM(^ij ~ 11^*^), 
that is, the probability that the link "truly" exists given our observa- 
tion of the whole network (and our choice of the family of stochastic 
block models). Assuming no prior knowledge about the suitability of 
the models, we obtain (Methods) 



4E 



+ 1 



+ 2 



exp[~W(P)] 



[3] 



where the sum is over partitions P in the space V of all possible parti- 
tions of the network into groups, ai is node i's group (in partition P), 
l^p is the number of links in the observed network between groups a 
and P, and r^p is the maximum possible number of links between a 
and P (Fig. [TJ. The function HIP) is a function of the partition 
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[4] 



andZ = Epepexp[-?^(P)]. 

In practice, it is not possible to sum over all partitions even for 
small networks 0. However, since Eq. l|3j has the same mathematical 
form as an ensemble average in statistical mechanics [22|, one can 
use the Metropolis algorithm to correctly sample relevant partitions 
(that is, partitions that significantly contribute to the sum) and obtain 
estimates for the link reliability (Methods). 

We use the link reliability to identify missing and spurious inter- 
actions in network observations. We evaluate the performance of our 



approach using five high-quality networks: the social network of in- 
teractions between people in a karate club 1 23 1, the social network of 
frequent associations between 62 dolphins |24|, the air transportation 
network of Eastern Europe |25 |, the neural network of the nematode 
C. elegans |26|, and the metabolic network of E. coli |27, 28 1. All 
of these networks have been manually curated and are widely used 
in the literature as model systems. Therefore, in what follows we 
assume that each of these networks is the "true" network yl^ and is 
error-free. We then generate hypothetical observations A'~^ by adding 
or removing random connections from , and evaluate the ability 
of our approach to recover the features of the true network^. 

To quantitatively study missing interactions, we generate ob- 
served networks A \>y removing random links from the true net- 
work . We then estimate the link reliability Rfj for each of these 
false negatives {Afj = and A^j = 1), as well as for the true nega- 
tives {A'^j — and Afj — 0). We measure the algorithm's ability to 
identify missing interactions by ranking the reliabilities (in decreas- 
ing order) and calculating the probability that a false negative has a 
higher ranking than a true negative |T1|. Similarly, we quantify the 
ability to identify spurious interactions by adding random links to the 
true network, obtaining and ranking the link reliabilities (again, in 
decreasing order), and calculating the probability that a false positive 
(Afj — 1 and y4 J = 0) is ranked lower than a true positive {A^j ~ 1 
and Afj = 1). 

In Fig. (2] we compare our approach to the hierarchical random 
graph (HRG) approach of Clauset el al. 1 1 1 1 and to a local algo- 
rithm based on the number of common neighbors between each pair 
of nodes (29111 U (Methods; see Supporting Information, Fig. SI, for 
a comparison to other local algorithms). We find that, except for one 
network, our approach consistently outperforms all others at identify- 
ing both missing interactions and spurious interactions. Our approach 
is also the only one that performs consistently well for all networks 
(unlike local algorithms, which work well for some networks but 
very poorly for others) and for both missing and spurious interactions 
(unlike the HRG algorithm, which performs comparatively worse at 
detecting spurious interactionsQ). Our algorithm is also consistently 
the most accurate when applied to a number of model networks, in- 
cluding networks with hierarchically nested modules, networks with 
a strongly disassortative role structure, and non-modular scale-free 
networks (Supporting Information, Fig. S2). We find that only when 
the network is strictly a hierarchical random graph, is the HRG ap- 
proach slightly more accurate at predicting missing interactions than 
the BM approach (Supporting Information Sees. 2 and 3). Remark- 
ably, even for strict hierarchical random graphs, the BM approach is 
more accurate at identifying spurious interactions. 



Network reliability and network reconstruction 

The success at detecting both missing and spurious interactions con- 
firms that our approach is able to uncover the structural features of the 
true network . The natural question is thus whether it is possible 
to "reconstruct" the observation A^ to gain greater insight into the 
global structure of A^ . This is difficult because, in general, adding 
a few candidate missing interactions and removing a few candidate 
spurious interactions does not give satisfactory network reconstruc- 
tions (one of the main problems being that one does not know, a pri- 
ori, how many missing and spurious interactions there are). 



The number of distinct partitions of elements into groups is 

E"=i m Ef=i (?) (-l)*"^' i™,wtiicti grows fasterttian any finite power of AT. 
^By adding and removing connections in thiis way, we are impiicitiy focusing on random er- 
rors; we discuss at tfie end how our approacfi can aiso deai witfi systematic (or, in generai, 
correiated) errors. 

A piausibie expianation for tfiis befiavior is that, because in tfie HRG modei most parameters 
are used to "fit" iow-ievei features of tfie network (pairs of nodes, triplets of nodes, and so on), 
tfie HRG approacfi may overfit spurious iinks. 
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Therefore, the first step toward network reconstruction is to ob- 
tain the networlc reliability = pbm{A\A'~'), that is, the proba- 
bility that A is the true network given our observation A'^ (and our 
choice of the family of stochastic block models). We obtain (Meth- 
ods) 



where 
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la/i is the number of links in A between groups a and /3, and 'H(P) 
and Z are the same as in Eq. Once more, we use the Metropolis 
algorithm to estimate 7? 4^. 

Given the network reliability — pbm{A\A'-'), the expected 
value of a property X 



{X>=^X(A)ii 



[7] 



over all possible networks A is a better estimate of X(yl^) than 
X{A°). We find that in many situations R'^t > R^o (Support- 
ing Information, Fig. S13), which means that, presented only with 
an inaccurate observation A'^ (and with the knowledge about com- 
plex networks embodied in the stochastic block model family), our 
approach is remarkably able to identify that is a more likely net- 
work than the observation A'-' itself. This confirms that, even without 
knowing A"^, it is possible to estimate a property X{A'^) better than 
just by measuring that property on A'~^ (that is, better than assuming 
X{A^) ^X{A°)). 

Since summing over all possible networks in Eq. (Q is pro- 
hibitive, we use the approximation (X) ~ X{A^), where A^ is the 
network that maximizes R^ (in other words, A^ is the maximum a 
posteriori estimate of A). The network A^ is what we call a network 
reconstruction, and we claim that X(yl^) is, in general, a better es- 
timate of than X{A'-^). In practice, we build reconstructions 
by heuristically maximizing R^, starting from A'^ (Methods). 

We test our network reconstruction approach by generating hy- 
pothetical observed networks A'^ from the true test networks A^ 
described above. Each observation has a fraction of the true interac- 
tions removed (we call this fraction the observation error rate), and 
an identical number of random interactions added. In Fig. [3] we show 
the true air transportation network of Eastern Europe, as well as a hy- 
pothetical observation of this network (with an observation error rate 
of 20%) and the corresponding reconstruction. The reconstruction 
has 13% fewer missing and spurious interactions than the observation 
and, qualitatively, it appears that individual node properties (specifi- 
cally, degree and betweenness centrality) are also better captured by 
the reconstruction. 

However, from a systems perspective global network properties 
are more relevant than local node-level features. Therefore, the ulti- 
mate goal is to generate network reconstructions whose global prop- 
erties are closer to those of the true network than those of the ob- 
servations. To quantitatively investigate whether our approach ac- 
complishes this aim, we calculate six network properties (static and 
dynamic) for observations and for the corresponding reconstructions 
of the air transportation network of Eastern Europe, and compute the 
relative error with respect to the true value. As we show in Fig. [4] 
the reconstruction consistently improves the estimates of these prop- 
erties. Only when the observed network contains less than 10% of 



errors it is better, for a few of the properties, to use the observed 
network rather than the reconstruction. We obtain similar results for 
other networks and other network properties (Supporting Informa- 
tion, Figs. S8-S12). 



Application to a protein interaction networic 

As we have discussed before, protein interaction networks are among 
the networks that may benefit the most from our approach. Ulti- 
mately, only experiments can prove our results useful; such exper- 
iments are, however, beyond the scope of this work. Nevertheless, 
here we show, however, how our approach can help in directing the 
effort to refine protein interaction data. 

We consider the protein interaction network of yeast that Gavin 
et al. obtained using affinity purification and mass spectrometry 
(AP/MS) 1 12 1. In AP/MS essays, a "bait" protein is used to detect 
"prey" proteins that interact with the bait directly or indirectly. Since 
bait and prey play different roles, we limit ourselves to the set of 991 
proteins that are both viable baits and viable prey 1 10| (for example, 
we discard proteins that only appear as prey because prey-prey inter- 
actions cannot possibly be observed). We build a protein interaction 
network by connecting all pairs of proteins that are reported as a bait- 
prey pair at least oncqj. From this network, we obtain the reliability 
for all pairs of proteins. 

We evaluate how successful our algorithm is by considering those 
proteins among the 991 in the network that have been used once, and 
only once, as bait (some proteins are used as bait in several inde- 
pendent essays). For a pair of these proteins A and B, a link in the 
network can represent two distinct situations: (i) the interaction was 
observed once (with A as bait and B as prey but not the other way 
around, or vice versa); (ii) the interaction was observed twice (both 
with A as bait and with B as bait). Since these experimentally "non- 
reproducible" and "reproducible" interactions (3113 and 867 inter- 
actions in our network, respectively) are encoded identically in the 
network, it is interesting to see if our algorithm assigns lower relia- 
bility to the first and higher to the latter. 

Remarkably, among the 100 interactions with the lowest link re- 
liability according to our algorithm, only 5 are experimentally repro- 
ducible. Conversely, among the 100 interactions with the highest link 
reliability, as many as 65 are experimentally reproducible. The prob- 
abilities of observing by chance such a small number in the first case 
and such a large number in the latter case are p< = 3 x 10^® and 
p> — 2 X 10"^", respectively. Our approach is therefore successfully 
separating interactions that are likely to be spurious from those that 
are likely to be correct, without using any biophysical or biochemical 
information. 



Discussion 

We have shown that our network reconstruction method allows for 
a better characterization of network data sets, which will be partic- 
ularly useful in data sets that we know contain many inaccuracies, 
such as protein interactomes. We have also shown that our approach 
reliably identifies missing and spurious interactions in complex net- 
works, so that we can identify suspect interactions for further exper- 
imental probing. 

Interestingly, our method can also guide new discoveries. If a 
given interaction between i and j truly exists but our approach pre- 
dicts a very low reliability for the interaction (or vice versa), that 
means that the function of the interaction is very specific (since the 
interaction is rare among nodes that are otherwise similar to i and j) 
and, therefore, functionally or evolutionarily important. 



Note that we do not advocate ttiat ttiis is ttie most appropriate procedure to anaiyze ttie struc- 
ture of a protein interaction networl^ (see '10' for a detaiied discussion). Rather, we use this 
procedure because it enabies us to test whether our algorithm can separate the ieast reiiabie 
and most reiiabie interactions. 
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Finally, our approach is flexible enough to allow generalizations 
in several directions. Arguably the most important of these is the 
extension to arbitrarily sophisticated families of models. In particu- 
lar, one could use models that are the "product" of a network model 
Mn (probably a block model) and an error model Me that incorpo- 
rates the relevant error structure (maybe another block model with 
non-uniform priors). The flexibility of our approach, along with its 
generality and its performance, will make it applicable to many areas 
where network data reliability is a source of concern. 

Materials and Methods 

Outline of ihe reiiabiiity caicuiations. 

Formally, a block model A/ = (P, Q) is completely determined by the 
partition P of nodes into groups and the matrix Q of probabilities of linkage 
between groups, so that Eq. (2) in the main text can be rewritten as 

Pbm{X = x\A°) = ^V/ dQpiX ^x\P,Q)x 

X pbm(A°|P,Q)p(P,Q), [8] 

where V is the space of all possible partitions of the network into groups, G is 
the number of distinct group pairs, and Z is a normalizing constant. 

Within the family of stochastic block models, one can evaluate the likeli- 
hood of each model AI because the probability of any two nodes i and j being 
connected depends only on the groups to which they belong. We have that ,1^ 

PBM(A°iP, Q) = n (1 - Qc^^Y"'''"' > [9] 

ci</3 

where l^p is the number of links in A'^ between nodes in groups Q and /3 of 
P, and ra/s is the maximum number of such links (that is, the number of pairs 
of nodes such that one node is in a and the other is in 13). 

Note that, among all possible block models, there is at least one whose 
likelihood is 1 , namely the block model in which each node is in a different block 
and each Qap is 1 or depending on whether the corresponding nodes are 

connected or not. This model contributes to p{X — x\A''') much more than 
most other models. However, there is only one such model (or very few), whereas 
there are many models with, for example, four blocks. This "entropic" term pre- 
vents overfitting of the network by very detailed (and ultimately uninformative) 
block models. 

Using that p(Aij = 1|P, Q) = Qa^a^ (where (7; is the module of node 
i in partition P) and assuming no prior knowledge about the models (that is, 
p(P, Q) — const.), one can use Eqs. [8] and [9] to obtain Eqs. (3)-(6) in the 
main text. 

Metropolis estimation of link and network reliability. 

To estimate the link and network reliabilities given by Eqs. |3] and [5] we 
use the following procedure. We start by placing each of the A'^ nodes in a group, 
which we choose with uniform probability from a set of A'^ possible groups (that 
is, there are as many groups as nodes). In general, some of these A'^ groups will 
be empty after the initial node assignment. 

At each step we select a random node and attempt to move it to a randomly 
selected group. This update scheme is appropriate because: (1) it results in an 



ergodic exploration of the space of possible partitions, and (ii) it satisfies detailed 
balance (since the probability of choosing a move and its reverse are identical). 
To decide whether we accept the move, we calculate the change A'H (Eq. [4]): 
if AH < 0, the change is automatically accepted; otherwise, the change is 
accepted with probability exp(— AH). 

The sampling procedure starts after an equilibration period, during which 
T-L decreases from an initial value to its equilibrium value. We sample the par- 
tition space by considering S — 10* partitions, each one separated from the 
previous one by a number of steps that is large enough for the two partitions 
to be reasonably uncorrelated (as measured by the mutual information between 
partitions). 

Because the link and network reliabilities are ensemble averages over in- 
dependent partitions, it is straightforward to parallelize the algorithm so that the 
partitions are obtained concurrently Therefore, given enough computational re- 
sources, the reliabilities can be calculated even for large networks (probably up 
to millions of nodes) in relatively short times. 

Benchmark algorithms for the Identification of missing and spurious inter- 
actions. 

The hierarchical random graph approach is described in detail in 
QTi. We use the implementation provided by the authors (available at 
http://www.santafe.edu/~aaronc/hierarchy/hrg.20080819.predictHRG.v1 .0.S.tgzi 
which we modified slightly to be able to study spurious as well as missing inter- 
actions. 

We analyze three local algorithms: common neighbors, degree product, and 
Jaccard index (the last two in Supporting Information only). For each of these 
algorithms, the link "reliability" Pfj is defined as follows (note that, for these 
approaches, the "reliability" is not a probability but just a score that enables us 
to rank node pairs): 

• Common neighbors: Rfj = \\Vi n Fj ||, where Vi is the set of neighbors 
of node i, and ||. • .|| indicates the number of nodes in a set. 

• Degree product: Rfj = \\r.i\\ X ||rj||. 

• Jaccard index: R^j = ||r, n rj|[/|[r, U FjW. 

Heuristic network reconstruction. 

The goal of the heuristic network reconstruction algorithm is to find A^ — 
arg max^ R^, where R^ — p{A\A'-') is the reliability of network A given 
observation A^ . Since exhaustive maximization of R^ is not possible, we use 
the following heuristic method. Start by evaluating the link reliabilities Rfj for all 
pairs of nodes in A'^; sort observed links {Afj — 1) by increasing reliability 

and observed non-links {A^j = 0) by decreasing reliability Then choose pairs 
of link/non-link in order: remove the link (which has a low reliability) and add 
the non-link (which has a high reliability), and accept the change if, and only if, 
P^ increases. Repeat this procedure, going down the lists, until we reject five 
consecutive attempts to swap a link/non-link pair. At this point, reevaluate Pfj 
and repeat the process. The algorithm stops when no link swaps are accepted. 
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Fig. 1. Stochastic block models. A stochastic block modei is fuliy specified by a partition of 
nodes into groups and a matrix Q in which each element Qafi represents the probability 
that a node in group a connects to a node In group 13. A, A simple matrix of probabilities 
Q. Nodes are divided In three groups (which contain 4, 5, and 6 nodes, respectively) and 
are represented as squares, circles, and triangles depending on their group. The value of 
each element Q^^^ is Indicated by the shade of gray; for example, squares do not connect 
to other squares, and connect to triangles with small probability, but squares connect to 
circles with high probability. B, A realization of the model In A. In this realization, the 
number of links between the square and the triangle group Is I^a = 4, whereas the 
maximum possible number of links between these groups Is rg^ = 24. 




Figure[T] 
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Fig. 2. Identification of missing and spurious linl<s. We compare ttie approacii presented 
here (blacl< circles), to tiie approacfi of Clauset et al. '11 (wliite squares) and to a local 
algorithm based on the number of common neighbors between pairs of nodes 29 1 lj 
(white triangles) (Methods; see Supporting Information for a comparison to other local 
algorithms). A-E IVIissing linl^s. For each true networl< we remove a fraction / of its 
linl^s to generate an observed network A'-' , calculate the reliability i?^ for each pair 
of nodes, and ranl< pairs of nodes in order of decreasing reliability. Accuracy is calculated 
as the probability that a false negative (one of the linl<s we removed, that is, Af- = 
but Aj- = 1) has a higher ranking than a true negative {A^'j = and Af^ = 0). The 
dashed line indicates the baseline accuracy when false negatives and true negatives are 
randomly ranked. F-J Bogus links. For each true network we add a fraction / of 
links to generate an observed network A'^ , calculate the link reliability iJ^ for each pair 
of nodes, and rank pairs of nodes in order of decreasing reliability Accuracy is calculated 
as the probability that a false positive (one of the links we added, that is, yl^ = 1 but 
Afj = 0) has a lower ranking than a true positive (Ag = 1 and Af^ = 1). The dashed 
line indicates the baseline accuracy when false positives and true positives are randomly 
ranked. Scores for algorithms other than the present approach are obtained as described 
earlier QT] (Methods) and ties are randomly broken when necessary. 
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Fig. 3. Reconstruction of the air transportation network of Eastern Europe. A, The true 
air transportation network. The area of each node is proportional to its betweenness 
centrality, with Moscow being the most central node In the network. B, The observed air 
transportation network, which we build by randomly removing 20% of the real links and 
replacing them by random links. C, The reconstructed air transportation network that we 
obtain, from the observed network, applying the heuristic reconstruction method described 
in the text and methods. For clarity In B (respectively C) we do not depict the correct 
links, but only: (i) missing links in orange, which exist in the true network but not In the 
observation (reconstruction), and (11) spurious links In blue, which do not exist In the true 
network but do exist in the observation (reconstruction). As in A, the area of each node 
is proportional to its betweenness centrality with the black circle representing the true 
betweenness centrality of each node. The color of each node represents the relative error 
in the degree of the node, with respect to the true degree. The observed network contains 
60 missing and 60 spurious links, whereas the reconstruction only contains 52 of each (a 
13% improvement). In general, node degree and betweenness centrality are also better 
captured in the reconstruction. 
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Fig. 4. Properties of observations and reconstructions of the air transportation network 
of Eastern Europe. In each case, the observation A'^ is generated from the true network 
by randomizing a fraction / of its links. The reconstruction is generated from A'-' 
as described in the text and methods. For each property X, we calculate the relative error 
of the observation {X(A'^) — X{A'^))/X{A'^) (black circles) and of the reconstruction 
{X{A^) - X{A'^))/X{A'^) (white squares). Symbols represent the mean over 25 
repetitions, and the error bars indicate the standard error of the mean. The shaded region 
corresponds to the region with smaller relative error (in absolute value) than the observa- 
tion, so that squares within the shaded region correspond to reconstructions that provide 
better network-property estimates than the observation itself. A-C, Static properties; A, 
Clustering coefficient 30 ; B, IVIodularity 30 ; and C, Assortativity 30 . D-F, Dynamic 
properties: D, Transportation congestability, that is, the maximum betweenness centrality 
in the network 31 ; E, Synchronizability that is, the ratio between the largest eigenvalue 
and the smallest non-zero eigenvalue of the Laplacian matrix of the network ^32,; F, 
Spreading threshold, that is, the ratio between the first and the second moments of the 
degree distribution ,33i . 
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